Chapter 8: GeoPandas Python Library#
Managing, analyzing and visualizing vector data#
Learning Objectives:#
GeoPandas GeoDataFrame
Advantages/disadvantages
Data checks and exploration
Spatial attributions and projecting
Interactive and static plotting
Spatial analysis
Geometric manipulations
Set-Operations
Subsetting and (spatial) aggregation
Integration with Rasterio
1. Set up Miniconda#
Usefull resources:#
Conda Cheatsheet: https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf
The geospatial package! Use Mamba to install… https://geospatial.gishub.org/installation/
Geospatial packages used today:#
Geopandas
Folium
Rasterio
Rasterstats
2. What is GeoPandas?#
GeoPandas is a crucial and extensive library used in the field of geography and GIS (Geographical Information Systems).
GeoSeries - the ‘geometry’ column#

Just an extension to Pandas?#
GeoPandas indeed builds on the capabilities of Pandas, but there’s much more to it.
GeoDataFrames look the same:
GeoDataFrames look similar to the conventional Pandas DataFrame, complete with rows and columns that can be manipulated using typical DataFrame operations.
Fully integrates with Pandas’ functions:
GeoPandas not only adopts the appearance of Pandas but also integrates fully with its functions. You can apply regular Pandas methods to GeoDataFrames
Differences with Pandas:#
A GeoSeries which stores geometries and CRS (Coordinate Reference System): Unlike a regular Pandas Series, a GeoSeries stores spatial objects and can include CRS information, defining how the coordinates relate to locations on the Earth’s surface.
Exploring geospatial data: With GeoPandas, quick visualizations of spatial data are possible, allowing a more intuitive understanding of geographic relationships.
Most typical GIS operations available: From spatial joins to overlays, GeoPandas offers the tools to perform most common GIS operations, something beyond the capabilities of standard Pandas.
Advantages#
Spatial indexing - spatial selections are super fast
The whole Python data science stack is available to you!!
Disadvantages#
Limited 3D Support: GeoPandas primarily focuses on 2D geometries. For applications requiring extensive 3D support, other specialized libraries might be more suitable.
Geometric objects#
GeoPandas represents geometry objects in text format. Note, it uses the package Shapely to interpret these objects.
Three main geometric object types:

The following example are geometric representations of a polygon and a multipolygon:
Coordinates: The series of coordinate pairs inside the double parentheses define the vertices of the polygon.
POLYGON: indicates that the geometry is representing a polygon. A polygon is a two-dimensional shape with straight sides, often used in GIS and spatial analysis to define areas or regions.
MULTIPOLYGON: indicates that the geometry consists of multiple polygons. Each polygon is a collection of linear rings that form a closed loop. MULTIPOLYGONs are used to describe complex features consisting of several non-connected parts. (The MULTIPOLYGON below contains two separate polygons, each enclosed in double parentheses)
‘POLYGON ((-123.12439615721482 49.253389655118525, -123.12216852657728 49.25334321506493,
-123.12035729640567 49.25330538420304, -123.12038689523503 49.252491276653686,
-123.12041649318077 49.25167719596781, -123.12222696366466 49.25171528756164,
-123.12446759063558 49.25176233315163, -123.124431874511 49.25257599869532,
-123.12439615721482 49.253389655118525))’
‘MULTIPOLYGON ((-123.12439615721482 49.253389655118525, -123.12216852657728 49.25334321506493,
-123.12035729640567 49.25330538420304, -123.12038689523503 49.252491276653686,
-123.12041649318077 49.25167719596781, -123.12222696366466 49.25171528756164,
-123.12446759063558 49.25176233315163, -123.124431874511 49.25257599869532,
-123.12439615721482 49.253389655118525)),
((-123.12439615721482 49.253389655118525, -123.12216852657728 49.25334321506493, -123.12035729640567 49.25330538420304, -123.12439615721482 49.253389655118525)))
3. Exploring GeoDataFrames#
Data sets#
From City of Vancouver’s Open Data Portal - https://opendata.vancouver.ca/
Dataset |
Type |
File name |
Description |
|---|---|---|---|
Street trees |
Points |
street-trees.geojson |
https://opendata.vancouver.ca/explore/dataset/street-trees/information/ |
Parks |
Polygons |
parks-polygon.geojson |
https://opendata.vancouver.ca/explore/dataset/parks-polygon-representation/information/ |
Local areas |
Polygons |
local-area-boundary.geojson |
https://opendata.vancouver.ca/explore/dataset/local-area-boundary/information/ |
Disability Parking |
Text |
disability-parking.geojson |
https://opendata.vancouver.ca/explore/dataset/disability-parking/information/ |
Pandas functions#
# pip install folium
import geopandas as gpd
import pandas as pd
import numpy as np
import folium
from folium import plugins
import matplotlib.pyplot as plt
# Load all required spatial datasets
gpdTrees = gpd.read_file('week11_data/street-trees.geojson')
gpdParks = gpd.read_file('week11_data/parks-polygon.geojson')
gpdParking = gpd.read_file('week11_data/disability-parking.geojson')
gpdAreas = gpd.read_file('week11_data/local-area-boundary.geojson') # local areas
# Many file formats are supported (https://geopandas.org/en/stable/docs/user_guide/io.html)
# - Shapefiles
# - Zipped Shapefiles
# - GeoPackage database files
# - KML / KMZ
# - NO ESRI geodatabases!!
---------------------------------------------------------------------------
CPLE_OpenFailedError Traceback (most recent call last)
fiona/ogrext.pyx in fiona.ogrext.gdal_open_vector()
fiona/_err.pyx in fiona._err.exc_wrap_pointer()
CPLE_OpenFailedError: week11_data/street-trees.geojson: No such file or directory
During handling of the above exception, another exception occurred:
DriverError Traceback (most recent call last)
/var/folders/qt/qy37fy6s3c1553y2k74063pr0000gn/T/ipykernel_17362/3198774724.py in <module>
10 # Load all required spatial datasets
11
---> 12 gpdTrees = gpd.read_file('week11_data/street-trees.geojson')
13 gpdParks = gpd.read_file('week11_data/parks-polygon.geojson')
14 gpdParking = gpd.read_file('week11_data/disability-parking.geojson')
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/geopandas/io/file.py in _read_file(filename, bbox, mask, rows, engine, **kwargs)
257
258 if engine == "fiona":
--> 259 return _read_file_fiona(
260 path_or_bytes, from_bytes, bbox=bbox, mask=mask, rows=rows, **kwargs
261 )
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/geopandas/io/file.py in _read_file_fiona(path_or_bytes, from_bytes, bbox, mask, rows, where, **kwargs)
301
302 with fiona_env():
--> 303 with reader(path_or_bytes, **kwargs) as features:
304 crs = features.crs_wkt
305 # attempt to get EPSG code
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fiona/env.py in wrapper(*args, **kwds)
455
456 with env_ctor(session=session):
--> 457 return f(*args, **kwds)
458
459 return wrapper
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fiona/__init__.py in open(fp, mode, driver, schema, crs, encoding, layer, vfs, enabled_drivers, crs_wkt, allow_unsupported_drivers, **kwargs)
334
335 if mode in ("a", "r"):
--> 336 colxn = Collection(
337 path,
338 mode,
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fiona/collection.py in __init__(self, path, mode, driver, schema, crs, encoding, layer, vsi, archive, enabled_drivers, crs_wkt, ignore_fields, ignore_geometry, include_fields, wkt_version, allow_unsupported_drivers, **kwargs)
241 if self.mode == "r":
242 self.session = Session()
--> 243 self.session.start(self, **kwargs)
244 elif self.mode in ("a", "w"):
245 self.session = WritingSession()
fiona/ogrext.pyx in fiona.ogrext.Session.start()
fiona/ogrext.pyx in fiona.ogrext.gdal_open_vector()
DriverError: week11_data/street-trees.geojson: No such file or directory
# Pandas methods for DataFrame exploration can be used!
gpdTrees.head()
| diameter | genus_name | common_name | cultivar_name | species_name | assigned | root_barrier | plant_area | curb | std_street | date_planted | on_street | tree_id | civic_number | neighbourhood_name | height_range_id | on_street_block | street_side_name | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15.00 | PRUNUS | KWANZAN FLOWERING CHERRY | KWANZAN | SERRULATA | N | N | 7 | Y | W 3RD AV | NaN | W 3RD AV | 2504 | 1701 | FAIRVIEW | 3 | 1700 | ODD | POINT (-123.14337 49.26891) |
| 1 | 12.00 | THUJA | PYRAMIDAL ARBORVITAE | PYRAMIDALIS | OCCIDENTALIS | N | N | B | Y | W 3RD AV | NaN | W 3RD AV | 2505 | 1701 | FAIRVIEW | 3 | 1700 | ODD | POINT (-123.14368 49.26892) |
| 2 | 13.75 | PRUNUS | KWANZAN FLOWERING CHERRY | KWANZAN | SERRULATA | N | N | 7 | Y | W 3RD AV | NaN | W 3RD AV | 2517 | 1760 | FAIRVIEW | 3 | 1700 | EVEN | POINT (-123.14449 49.26883) |
| 3 | 15.00 | PRUNUS | KWANZAN FLOWERING CHERRY | KWANZAN | SERRULATA | N | N | 7 | Y | W 3RD AV | NaN | W 3RD AV | 2518 | 1760 | FAIRVIEW | 3 | 1700 | EVEN | POINT (-123.14482 49.26884) |
| 4 | 26.00 | PRUNUS | KWANZAN FLOWERING CHERRY | KWANZAN | SERRULATA | N | N | 6 | Y | W 3RD AV | NaN | W 3RD AV | 2522 | 1780 | FAIRVIEW | 3 | 1700 | EVEN | POINT (-123.14515 49.26885) |
# Let's check out the very powerful method '.info()'
gpdTrees.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 149930 entries, 0 to 149929
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 diameter 149930 non-null float64
1 genus_name 149930 non-null object
2 common_name 149930 non-null object
3 cultivar_name 79512 non-null object
4 species_name 149930 non-null object
5 assigned 149930 non-null object
6 root_barrier 149930 non-null object
7 plant_area 148447 non-null object
8 curb 149930 non-null object
9 std_street 149930 non-null object
10 date_planted 67628 non-null object
11 on_street 149930 non-null object
12 tree_id 149930 non-null object
13 civic_number 149930 non-null object
14 neighbourhood_name 149930 non-null object
15 height_range_id 149930 non-null int64
16 on_street_block 149930 non-null object
17 street_side_name 149930 non-null object
18 geometry 134199 non-null geometry
dtypes: float64(1), geometry(1), int64(1), object(16)
memory usage: 21.7+ MB
So there are 149930 street trees.
Missing data:
134199 records include geometry
cultivar_name provided for 79512 records
date_planted provided for 67628 records
“A plant cultivar refers to a variation within a plant species that has been developed by a human horticulturist through controlled plant breeding, as opposed to occurring naturally”
# Output some basic stats
gpdTrees.describe()
| diameter | height_range_id | |
|---|---|---|
| count | 149930.000000 | 149930.000000 |
| mean | 11.935306 | 2.662149 |
| std | 9.210158 | 1.552271 |
| min | 0.000000 | 0.000000 |
| 25% | 4.000000 | 1.000000 |
| 50% | 9.500000 | 2.000000 |
| 75% | 17.000000 | 4.000000 |
| max | 305.000000 | 10.000000 |
Subsetting and Aggregating GeoDataFrames#
Subsetting GeoDataFrames works exactly the same as with normal Pandas DataFrames!
The following example identifies and extracts the rows from a GeoPandas GeoDataFrame where the spatial geometry information (in the ‘geometry’ column) is missing, and display the last five rows.
.isna(): returns a Boolean Series, with True for every row where the ‘geometry’ value is NaN (Not a Number) or NA (Not Available). In other words, it identifies rows where the spatial information is missing.
# We discovered entries without gemeometry, lets explore...
selection = gpdTrees.loc[gpdTrees['geometry'].isna()]
selection.tail()
| diameter | genus_name | common_name | cultivar_name | species_name | assigned | root_barrier | plant_area | curb | std_street | date_planted | on_street | tree_id | civic_number | neighbourhood_name | height_range_id | on_street_block | street_side_name | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 149919 | 4.0 | CRATAEGUS | LAVALLEI HYBRID HAWTHORN | NaN | LAVALLEI X | N | N | 4 | Y | TRAFALGAR ST | NaN | TRAFALGAR ST | 284870 | 2706 | KITSILANO | 1 | 2700 | EVEN | None |
| 149920 | 9.0 | PRUNUS | RANCHO SARGENT CHERRY | RANCHO | SARGENTII | N | N | 4 | Y | TRAFALGAR ST | NaN | TRAFALGAR ST | 284871 | 2706 | KITSILANO | 1 | 2700 | EVEN | None |
| 149921 | 6.0 | CRATAEGUS | LAVALLEI HYBRID HAWTHORN | NaN | LAVALLEI X | N | N | 4 | Y | TRAFALGAR ST | NaN | TRAFALGAR ST | 284872 | 2706 | KITSILANO | 1 | 2700 | EVEN | None |
| 149926 | 42.0 | CEDRUS | DEODAR CEDAR | NaN | DEODARA | N | N | N | Y | VIVIAN DRIVE | NaN | VIVIAN DRIVE | 285245 | 7906 | KILLARNEY | 7 | 7900 | EVEN | None |
| 149928 | 16.0 | PRUNUS | JAPANESE FLOWERING CHERRY | NaN | SERRULATA | N | N | 8 | Y | E 11TH AV | NaN | E 11TH AV | 285346 | 1457 | KENSINGTON-CEDAR COTTAGE | 4 | 1400 | ODD | None |
The example below first filters the data frame by Neighbourhood: gpdTrees[‘neighbourhood_name’]==’DOWNTOWN’ filters the rows in the GeoDataFrame where the ‘neighbourhood_name’ column has the value ‘DOWNTOWN’. And gpdTrees[‘neighbourhood_name’]==’WEST POINT GREY’ filters the rows in the GeoDataFrame where the ‘neighbourhood_name’ column has the value ‘WEST POINT GREY’.
After filtering the rows, loc[] is used to select the ‘height_range_id’ column for those filtered rows.
The .mode() method calculates the mode of the ‘height_range_id’ values for the ‘DOWNTOWN’ neighborhood and ‘WEST POINT GREY’ neighborhood. The mode is the value that appears most frequently in a data set, so it gives you the most common height range in that neighborhood.
# Let's check tree height
print(gpdTrees.loc[gpdTrees['neighbourhood_name']=='DOWNTOWN', 'height_range_id'].mode())
print(gpdTrees.loc[gpdTrees['neighbourhood_name']=='WEST POINT GREY', 'height_range_id'].mode())
# Description 'height_range_id':
# values 0 - 10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, ..., and 10 = 100+ ft)
0 2
Name: height_range_id, dtype: int64
0 1
Name: height_range_id, dtype: int64
Also aggregation works the same!
df.groupby() is the basic aggregation methods used to compute within group statistics.
Subsetting and aggregating GeoDataFrames are essential techniques for managing and analyzing geospatial data.
Subsetting allows you to focus on specific parts of the data that meet certain criteria, while aggregation lets you summarize and understand patterns within the data. With aggregation, you can also calculate summary statistics for each group, such as mean, sum, count, etc.
These operations are powerful tools for data cleaning, exploration, and analysis, making them fundamental to working with geospatial data in GeoPandas.
Image from https://pandas.pydata.org/
This code below:
Groups the rows in the GeoDataFrame gpdTrees by the unique values in the ‘neighbourhood_name’ column. Essentially, it’s segmenting the data by neighborhood, treating each distinct neighborhood as a separate group.
After grouping the data by neighborhood, selects the ‘diameter’ column.
Finally, .mean() calculates the mean (average) of the ‘diameter’ values within each neighborhood group.
The result will be a pandas Series with the neighborhood names as the index and the mean diameters as the values.
gpdTrees.groupby('neighbourhood_name')['diameter'].mean()
neighbourhood_name
ARBUTUS-RIDGE 12.281992
DOWNTOWN 8.063486
DUNBAR-SOUTHLANDS 14.677531
FAIRVIEW 11.306774
GRANDVIEW-WOODLAND 12.788890
HASTINGS-SUNRISE 11.680582
KENSINGTON-CEDAR COTTAGE 11.957711
KERRISDALE 12.790653
KILLARNEY 10.104743
KITSILANO 15.035828
MARPOLE 11.728986
MOUNT PLEASANT 12.075201
OAKRIDGE 10.376248
RENFREW-COLLINGWOOD 9.639552
RILEY PARK 11.917149
SHAUGHNESSY 14.817716
SOUTH CAMBIE 12.042835
STRATHCONA 10.154407
SUNSET 10.820122
VICTORIA-FRASERVIEW 10.394755
WEST END 12.464704
WEST POINT GREY 13.539662
Name: diameter, dtype: float64
Interactive mapping!#
Create interactive map with two layers - polygons and points#
# pip install folium matplotlib mapclassify
#### Generate an interactive leaflet map based on GeoDataFrame
# Learn more about .explore() # https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html
# First subsetting...
# 1. Remove records with NaN in 'geometry'
gpdTrees2 = gpdTrees.dropna(subset='geometry')
# Learn more about functions like '.dropna()' here:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
# 2. Remove Stanley Park
gpdParks2 = gpdParks.loc[gpdParks['park_name']!='Stanley Park', :]
# Learn more about .loc and .iloc here:
# https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html
Here, the code below creates an interactive leaflet map “m” using the explore method. The parks are colored based on the ‘area_ha’ column, using the “viridis” colormap. Since categorical=False, the coloring is continuous rather than discrete.
Then the code filters gpdTrees2 to only include rows where the ‘common_name’ column contains ‘CHERRY’. Then it adds these cherry trees as another layer to the existing map m, setting specific styling properties.
# 3. Add Parks layer... We can color by variable, such as area, creating a choropleth
m = gpdParks2.explore(column="area_ha",
categorical=False,
cmap="viridis")
cherryTrees = gpdTrees2.loc[gpdTrees2['common_name'].str.contains('CHERRY'),: ]
#. 4 Add another layer to mapping object
cherryTrees.explore(m=m,
popup=False,
style_kwds={'stroke':False,
'fillOpacity':0.1,
'fillColor':'forestgreen'})
# 5. Display in Jupyter
m